Predicting Stroke using Machine Learning

Fall 2024 Data Science Project

August Joergensen (121317247), Noah Ryu Nguyen (121329011)

No description has been provided for this image

Contributions:¶

A: Project Idea: We contributed equally to the project idea. We shared interest in a health and medical setting inspired the focus on predicting stroke using machine learning.

B: Dataset Curation and Preprocessing: August worked on converting all metrics into a machine learning-compatible format, ensuring the dataset was structured for modeling. Noah contributed to the preprocessing by dealing with outliers, standardization, and Random Forest-based imputation for missing values of smoking status.

C: Data Exploration and Summary Statistics: Noah implemented SMOTE to address class imbalance, ensuring that the minority class (stroke cases) was better represented in the dataset.

D: ML Algorithm Design/Development: August designed and implemented the main Random Forest classifier, incorporating cross-validation and hyperparameter tuning to optimize recall.

E: ML Algorithm Training and Test Data Analysis: August conducted model training and testing. Noah compared the models using statistical tests and provided insights into the performance differences between the models.

F: Visualization, Result Analysis, Conclusion: We collaborated to generate visualizations, analyze the feature importance, and write the conclusion. Our joint efforts ensured clarity and made sure we both understood the material and results.

G: Final Tutorial Report Creation: We worked together to draft, refine, and finalize the report, ensuring it was well-structured and polished.

Introduction¶

Stroke is a leading cause of death and long-term disability worldwide, making its timely detection and prevention a critical area of medical research. In this analysis, we explore the application of machine learning models to predict the likelihood of stroke based on patient data, with a focus on addressing the challenges posed by class imbalance. Specifically, we aim to answer: Can machine learning models effectively identify stroke cases, even when the dataset is heavily skewed towards non-stroke cases? Additionally, we investigate whether machine learning can reveal which factors are most critical in determining the likelihood of stroke.

Answering these questions is vital because accurate stroke prediction can significantly enhance early intervention efforts, potentially saving lives and reducing healthcare costs. By leveraging techniques like Synthetic Minority Oversampling (SMOTE) and recall-focused evaluation metrics, this work seeks to develop a model that prioritizes the detection of stroke cases while identifying key predictive factors. This approach addresses the critical need for minimizing false negatives in this high-stakes domain and supports data-driven decision-making in stroke prevention and management.

Data Curration¶

In this notebook we will go through the preprocessing of the data and the basic data exploration. Our data set is called "healthcare-dataset-stroke-data.csv" and it contains information about patients and whether they had a stroke or not, and many of the patients' characteristics.

The dataset can be found here: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

In [1]:
# Core libraries
import numpy as np
import pandas as pd

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.subplots as sp
import plotly

plotly.io.renderers.default = "notebook"
plotly.offline.init_notebook_mode()

# Scikit-learn modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split, KFold, GridSearchCV, StratifiedKFold, cross_val_score
)
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report, accuracy_score, recall_score, confusion_matrix,
    roc_auc_score, roc_curve
)
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Statistical tests and utilities
from scipy.stats import levene, chisquare
from statsmodels.stats.contingency_tables import mcnemar
from imblearn.over_sampling import SMOTE
from scipy.stats import gaussian_kde
from collections import Counter
In [2]:
# seeding for reproducibility
np.random.seed(42)

Let's have a look at the data and see what we are working with. We can import the data as a pandas dataframe and use the head() function to see the first few rows of the data.

In [3]:
# Import dataset and overview
dataset = pd.read_csv('healthcare-dataset-stroke-data.csv')
dataset.head(20)
Out[3]:
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 51676 Female 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 31112 Male 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 1665 Female 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1
5 56669 Male 81.0 0 0 Yes Private Urban 186.21 29.0 formerly smoked 1
6 53882 Male 74.0 1 1 Yes Private Rural 70.09 27.4 never smoked 1
7 10434 Female 69.0 0 0 No Private Urban 94.39 22.8 never smoked 1
8 27419 Female 59.0 0 0 Yes Private Rural 76.15 NaN Unknown 1
9 60491 Female 78.0 0 0 Yes Private Urban 58.57 24.2 Unknown 1
10 12109 Female 81.0 1 0 Yes Private Rural 80.43 29.7 never smoked 1
11 12095 Female 61.0 0 1 Yes Govt_job Rural 120.46 36.8 smokes 1
12 12175 Female 54.0 0 0 Yes Private Urban 104.51 27.3 smokes 1
13 8213 Male 78.0 0 1 Yes Private Urban 219.84 NaN Unknown 1
14 5317 Female 79.0 0 1 Yes Private Urban 214.09 28.2 never smoked 1
15 58202 Female 50.0 1 0 Yes Self-employed Rural 167.41 30.9 never smoked 1
16 56112 Male 64.0 0 1 Yes Private Urban 191.61 37.5 smokes 1
17 34120 Male 75.0 1 0 Yes Private Urban 221.29 25.8 smokes 1
18 27458 Female 60.0 0 0 No Private Urban 89.22 37.8 never smoked 1
19 25226 Male 57.0 0 1 No Govt_job Urban 217.08 NaN Unknown 1

Now let's look at the types of features

In [4]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB

From this we can make several intersting observations. First we see that we have multiple features that are strings (objects), second we see that we have missing values in the bmi column. We will have to deal with these issues before we can start building our models.

Exploratory Data Analysis¶

Gender column from string to binary values¶

We begin by converting the gender strings to binary values. We will assign 1 to Male and 0 to Female.

In [5]:
# Here we make the gender column binary
dataset['gender'] = dataset['gender'].map({'Male': 1, 'Female': 0})
dataset.head()
Out[5]:
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 1.0 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 51676 0.0 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 31112 1.0 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 60182 0.0 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 1665 0.0 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1
In [6]:
dataset['gender'].unique()
Out[6]:
array([ 1.,  0., nan])

We see that there is a nan value in the gender column. Lets see how many nan values we have in the and how they look.

In [7]:
nan_gender_rows = dataset[dataset['gender'].isna()]
print(nan_gender_rows)
         id  gender   age  hypertension  heart_disease ever_married work_type  \
3116  56156     NaN  26.0             0              0           No   Private   

     Residence_type  avg_glucose_level   bmi   smoking_status  stroke  
3116          Rural             143.33  22.4  formerly smoked       0  

Since it's a single occassion we will just drop the row with the nan value.

In [8]:
dataset.dropna(subset=['gender'], inplace=True)

Converting 'ever_married' column to binary values¶

We see that the 'ever_married' column is also a string. We will convert it to binary values. We will assign 1 to Yes and 0 to No.

In [9]:
dataset['ever_married'] = dataset['ever_married'].map({'Yes': 1, 'No': 0})
In [10]:
dataset['ever_married'].unique()
Out[10]:
array([1, 0])

Converting 'work_type' column to intergers¶

In [11]:
dataset['work_type'].unique()
Out[11]:
array(['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'],
      dtype=object)

We will assign a number to each work type. We will assign 0 to 'Private', 1 to 'Self-employed', 2 to 'Govt_job', 3 to 'children' and 4 to 'Never_worked'.

In [12]:
dataset['work_type'] = dataset['work_type'].map({'Private': 0, 'Self-employed': 1, 'Govt_job': 2, 'children': 3, 'Never_worked': 4})

Converting 'Residence_type' column to binary values¶

In [13]:
dataset['Residence_type'].unique()
Out[13]:
array(['Urban', 'Rural'], dtype=object)

We will assign 1 to Urban and 0 to Rural.

In [14]:
dataset['Residence_type'] = dataset['Residence_type'].map({'Urban': 1, 'Rural': 0})

Converting 'smoking_status' column to intergers¶

In [15]:
dataset['smoking_status'].unique()
Out[15]:
array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
      dtype=object)

We will assign a number to each smoking status. We will assign 0 to 'never smoked', 1 to 'formerly smoked', 2 to 'smokes' and 3 to 'Unknown'.

In [16]:
dataset['smoking_status'] = dataset['smoking_status'].map({'never smoked': 0, 'formerly smoked': 1, 'smokes': 2, 'Unknown': 3})

dataset.head()
Out[16]:
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 1.0 67.0 0 1 1 0 1 228.69 36.6 1 1
1 51676 0.0 61.0 0 0 1 1 0 202.21 NaN 0 1
2 31112 1.0 80.0 0 1 1 0 0 105.92 32.5 0 1
3 60182 0.0 49.0 0 0 1 0 1 171.23 34.4 2 1
4 1665 0.0 79.0 1 0 1 1 0 174.12 24.0 0 1
In [17]:
dataset.dtypes
Out[17]:
id                     int64
gender               float64
age                  float64
hypertension           int64
heart_disease          int64
ever_married           int64
work_type              int64
Residence_type         int64
avg_glucose_level    float64
bmi                  float64
smoking_status         int64
stroke                 int64
dtype: object

We have now converted all the string columns to numerical values.

Cleaning for NaN values¶

Next up we will be looking at the potential missing values in the data set. We will use the isna() function to check for missing values in the data set.

In [18]:
dataset.isna().sum()
Out[18]:
id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

Lots of missing values in the bmi column. Let's check what percentage of the data is missing.

In [19]:
# Here we check the percentage of missing values in the BMI column
total_rows = len(dataset)
missing_bmi_rows = dataset['bmi'].isna().sum()
missing_bmi_percentage = (missing_bmi_rows / total_rows) * 100
print(f"Percentage of missing BMI data: {missing_bmi_percentage:.2f}%")
Percentage of missing BMI data: 3.93%

Since this is a significant amount of data we will have to deal with this. We will fill the missing values with the mean of the bmi column. But we will add a little normally distributed random noise to the data to make it more realistic using standard deviations. This will make the missing values distribute more evenly.

In [20]:
bmi_mean = dataset['bmi'].mean()
bmi_std = dataset['bmi'].std()

# The noise will have a mean of 0 and a standard deviation equal to the BMI std
noise = np.random.normal(0, bmi_std, size=dataset['bmi'].isna().sum())

# Creating mask for missing values
mask = dataset['bmi'].isna()

# Imputing missing values
dataset.loc[mask, 'bmi'] = bmi_mean + noise


print(f"Missing BMI data after imputation: {dataset['bmi'].isna().sum()}")
Missing BMI data after imputation: 0

Let's try and visualize the now filled out bmi column to see the distribution of the data. And alos to make sure that the filling out of the missing values did not change the distribution of the data.

In [21]:
# Data for histogram
bmi_data = dataset['bmi']

# Create histogram data (raw counts)
n_bins = 81  # Adjust this as needed
hist = np.histogram(bmi_data, bins=n_bins)
bins = hist[1]
bin_centers = (bins[:-1] + bins[1:]) / 2
hist_values = hist[0]

# Calculate bin width
bin_width = bins[1] - bins[0]

# Create a KDE from the data
kde = gaussian_kde(bmi_data) # KDE is a function that takes a value and returns the density estimate

# Evaluate the KDE over a range of x values
x = np.linspace(bmi_data.min(), bmi_data.max(), 500)
kde_values = kde(x)

# Scale the KDE to match the histogram counts
kde_scaled = kde_values * len(bmi_data) * bin_width

# Create the plot
fig = go.Figure()

# Add histogram
fig.add_trace(
    go.Bar(
        x=bin_centers,
        y=hist_values,
        name="BMI Histogram",
        marker=dict(color="blue", line=dict(color="black", width=1)),
        opacity=0.6,
    )
)

# Add KDE curve
fig.add_trace(
    go.Scatter(
        x=x,
        y=kde_scaled,
        mode="lines",
        name="KDE",
        line=dict(color="red", width=2),
    )
)

# Layout settings
fig.update_layout(
    title="BMI Distribution with KDE Overlay",
    xaxis_title="BMI",
    yaxis_title="Count",
    template="plotly_white",
    showlegend=True,
)

fig.show()

Now we see that the bmi distribution still fits a normal distribution, even though we added all the missing values around the mean with a noise parameter. Very nice!

Outlier Detection¶

Outlier detection is an important step in data preprocessing. Outliers can significantly affect the performance of machine learning models. We will use the IQR method to detect outliers in the numerical columns of the dataset.

We have three features that are non-binary: BMI, age, avg_gluclose_level. We will check for potential outliers in these columns

In [22]:
fig = make_subplots(rows=1, cols=3, subplot_titles=('BMI', 'Age', 'Avg Glucose Level'))

# Adding a boxplot for BMI 
fig.add_trace(
    go.Box(y=dataset['bmi'], name='BMI', marker_color='skyblue', 
           boxpoints='outliers', 
           marker=dict(outliercolor='red', color='skyblue')), 
    row=1, col=1
)

# Adding a boxplot for Age 
fig.add_trace(
    go.Box(y=dataset['age'], name='Age', marker_color='lightgreen', 
           boxpoints='outliers', 
           marker=dict(outliercolor='orange', color='lightgreen')), 
    row=1, col=2
)

# Adding a boxplot for Avg Glucose Level 
fig.add_trace(
    go.Box(y=dataset['avg_glucose_level'], name='Avg Glucose Level', marker_color='salmon', 
           boxpoints='outliers', 
           marker=dict(outliercolor='purple', color='salmon')), 
    row=1, col=3
)

# Update layout to adjust subplot titles and spacing
fig.update_layout(title_text='Boxplots for BMI, Age, and Avg Glucose Level',
                  showlegend=False,
                  height=500, width=1000)

# Show plot
fig.show()

The Age feature seems good, however BMI and Avg Glucose Level have lots of extreme values (seen by the data points laying outside the upper and lower quantile), so we will have to decide whether or not they are outliers or actual cases of extreme values.

First we will check the BMI and compare the outliers up with their age values to see if any odd pairs are present, i.e. a baby cannot have a bmi of let's say 100.

In [23]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = dataset['bmi'].quantile(0.25)
Q3 = dataset['bmi'].quantile(0.75)

# Calculate IQR (Interquartile Range)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers in BMI
outliers_bmi = dataset[(dataset['bmi'] < lower_bound) | (dataset['bmi'] > upper_bound)]

# Print out the potential BMI outliers
print("Potential BMI Outliers:")
print(outliers_bmi[['age', 'bmi']])

# Scatter plot to visualize outliers with age
fig = px.scatter(outliers_bmi, x='age', y='bmi', 
                 title='Extreme BMI Values vs Age',
                 labels={'bmi': 'BMI', 'age': 'Age'},
                 color='age')

fig.show()
Potential BMI Outliers:
       age   bmi
21    52.0  48.9
113   45.0  56.6
254   47.0  50.1
258   74.0  54.6
270   57.0  60.9
...    ...   ...
4858  43.0  47.6
4906  53.0  54.1
4952  51.0  56.6
5009  50.0  49.5
5057  49.0  47.6

[113 rows x 2 columns]

We first identified extreme BMI values by detecting outliers using the IQR method. The outliers are those BMI values that fall outside of the range defined by the interquartile range (IQR), which captures the middle 50% of the data. By plotting these outliers against the age values, we can assess whether certain BMI values are unusually high or low relative to age. This is important because, for example, a very young child should not have an excessively high BMI. In this scatter plot, the color represents the age, helping us visualize how BMI outliers vary across different age groups.

Here we for example a 17 year old kid with a BMI of 97.6. This is a very high BMI for a 17 year old and we will consider this an outlier.

We decide to remove the outliers that are more than 2 standard deviations away from the mean. This limit we believe is reasonable and will not remove too many data points.

In [24]:
Q1 = dataset['bmi'].quantile(0.25)
Q3 = dataset['bmi'].quantile(0.75)
IQR = Q3 - Q1

lower_bound_iqr = Q1 - 1.5 * IQR
upper_bound_iqr = Q3 + 1.5 * IQR

# Detect outliers based on IQR
outliers_bmi_iqr = dataset[(dataset['bmi'] < lower_bound_iqr) | (dataset['bmi'] > upper_bound_iqr)]

# Now, remove the points that are beyond 2 STDV from the mean of the outliers
mean_bmi_outliers = outliers_bmi_iqr['bmi'].mean()
std_bmi_outliers = outliers_bmi_iqr['bmi'].std()

# 2 STDV bounds for the detected IQR-based outliers
lower_bound_2stdv = mean_bmi_outliers - 2 * std_bmi_outliers
upper_bound_2stdv = mean_bmi_outliers + 2 * std_bmi_outliers

# Filter the IQR-based outliers to remove those beyond 2 STDV
filtered_outliers = outliers_bmi_iqr[(outliers_bmi_iqr['bmi'] >= lower_bound_2stdv) & 
                                     (outliers_bmi_iqr['bmi'] <= upper_bound_2stdv)]

print("Filtered BMI Outliers after removing 2 STDV:")
print(filtered_outliers[['age', 'bmi']])

fig = px.scatter(filtered_outliers, x='age', y='bmi', 
                 title='Filtered Extreme BMI Values (after removing 2 STDV) vs Age',
                 labels={'bmi': 'BMI', 'age': 'Age'},
                 color='age')

fig.show()
Filtered BMI Outliers after removing 2 STDV:
       age   bmi
21    52.0  48.9
113   45.0  56.6
254   47.0  50.1
258   74.0  54.6
270   57.0  60.9
...    ...   ...
4858  43.0  47.6
4906  53.0  54.1
4952  51.0  56.6
5009  50.0  49.5
5057  49.0  47.6

[108 rows x 2 columns]

The filtered outliers are plotted against the age values, similar to the previous plot, but now we see only those points that are within a more reasonable range. By comparing the filtered extreme BMI values to the age values, we can gain further insights into whether these BMI values are still unusual given the individual's age.

If we look at the chart we can now see that all the most extreme values are removed, i.e. the 17 year old kid with a BMI of 97.6 is now removed from the data set.

In [25]:
dataset = dataset[(dataset['bmi'] >= lower_bound_iqr) & (dataset['bmi'] <= upper_bound_iqr) | 
                  ((dataset['bmi'] >= lower_bound_2stdv) & (dataset['bmi'] <= upper_bound_2stdv))]

print(f"Updated dataset size after removing outliers: {len(dataset)}")
dataset.head()
Updated dataset size after removing outliers: 5104
Out[25]:
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 1.0 67.0 0 1 1 0 1 228.69 36.600000 1 1
1 51676 0.0 61.0 0 0 1 1 0 202.21 32.795912 0 1
2 31112 1.0 80.0 0 1 1 0 0 105.92 32.500000 0 1
3 60182 0.0 49.0 0 0 1 0 1 171.23 34.400000 2 1
4 1665 0.0 79.0 1 0 1 1 0 174.12 24.000000 0 1

Now let's look at the glucose levels and do the same

In [26]:
Q1 = dataset['avg_glucose_level'].quantile(0.25)
Q3 = dataset['avg_glucose_level'].quantile(0.75)

IQR = Q3 - Q1

# define lower and upper bounds for outliers based on avg_glucose_level
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# detect outliers in avg_glucose_level
outliers_glucose = dataset[(dataset['avg_glucose_level'] < lower_bound) | (dataset['avg_glucose_level'] > upper_bound)]

print("Potential Avg Glucose Level Outliers:")
print(outliers_glucose[['age', 'avg_glucose_level']])

# scatter plot to visualize avg_glucose_level outliers with age
fig = px.scatter(outliers_glucose, x='age', y='avg_glucose_level', 
                 title='Extreme Avg Glucose Level Values vs Age',
                 labels={'avg_glucose_level': 'Avg Glucose Level', 'age': 'Age'},
                 color='age')

fig.show()
Potential Avg Glucose Level Outliers:
       age  avg_glucose_level
0     67.0             228.69
1     61.0             202.21
3     49.0             171.23
4     79.0             174.12
5     81.0             186.21
...    ...                ...
5061  41.0             223.78
5062  82.0             211.58
5063  39.0             179.38
5064  70.0             193.88
5076  34.0             174.37

[625 rows x 2 columns]

This chart again shows the outliers in the glucose levels. We can see the extreme values of Avg Glucose Level and compare them to the age values to see if they are reasonable or not.

Generally we can see a more spread out pattern in glucose levels, making it harder to determine what is an outlier and what is not. This can be due to the fact that some people suffor from diabetes and have very high glucose levels. Although standard deviations are relative, so again removing values that lies 2 standard deviations away from the mean should be a good limit.

In [27]:
# Here, we will remove the outliers based on the IQR method and then filter out the points that are beyond 2 STDV from the mean of the outliers

Q1 = dataset['avg_glucose_level'].quantile(0.25)
Q3 = dataset['avg_glucose_level'].quantile(0.75)
IQR = Q3 - Q1

lower_bound_iqr = Q1 - 1.5 * IQR
upper_bound_iqr = Q3 + 1.5 * IQR

outliers_glucose_iqr = dataset[(dataset['avg_glucose_level'] < lower_bound_iqr) | (dataset['avg_glucose_level'] > upper_bound_iqr)]

mean_glucose_outliers = outliers_glucose_iqr['avg_glucose_level'].mean()
std_glucose_outliers = outliers_glucose_iqr['avg_glucose_level'].std()

lower_bound_2stdv = mean_glucose_outliers - 2 * std_glucose_outliers
upper_bound_2stdv = mean_glucose_outliers + 2 * std_glucose_outliers

filtered_outliers = outliers_glucose_iqr[(outliers_glucose_iqr['avg_glucose_level'] >= lower_bound_2stdv) & 
                                         (outliers_glucose_iqr['avg_glucose_level'] <= upper_bound_2stdv)]

dataset = dataset[~dataset.index.isin(outliers_glucose_iqr[~outliers_glucose_iqr.index.isin(filtered_outliers.index)].index)]

print(f"Number of outliers remaining after removing 2 STDV: {len(filtered_outliers)}")
outliers_removed = len(outliers_glucose_iqr) - len(filtered_outliers)
print(f"Number of outliers removed based on 2 STDV: {outliers_removed}")

fig = px.scatter(filtered_outliers, x='age', y='avg_glucose_level', 
                 title='Filtered Extreme Avg Glucose Level Values (after removing 2 STDV) vs Age',
                 labels={'avg_glucose_level': 'Avg Glucose Level', 'age': 'Age'},
                 color='age')

fig.show()
Number of outliers remaining after removing 2 STDV: 604
Number of outliers removed based on 2 STDV: 21

The final dataset was updated to reflect the removal of these outliers, ensuring more accurate and reliable analysis moving forward. We see that many extreme values are still present in the glucose levels, but we believe that these are not outliers but actual values.

In [28]:
fig = make_subplots(rows=1, cols=3, subplot_titles=('BMI', 'Age', 'Avg Glucose Level'))

# Adding a boxplot for BMI 
fig.add_trace(
    go.Box(y=dataset['bmi'], name='BMI', marker_color='skyblue', 
           boxpoints='outliers', 
           marker=dict(outliercolor='red', color='skyblue')), 
    row=1, col=1
)

# Adding a boxplot for Age 
fig.add_trace(
    go.Box(y=dataset['age'], name='Age', marker_color='lightgreen', 
           boxpoints='outliers', 
           marker=dict(outliercolor='orange', color='lightgreen')), 
    row=1, col=2
)

# Adding a boxplot for Avg Glucose Level 
fig.add_trace(
    go.Box(y=dataset['avg_glucose_level'], name='Avg Glucose Level', marker_color='salmon', 
           boxpoints='outliers', 
           marker=dict(outliercolor='purple', color='salmon')), 
    row=1, col=3
)

# Update layout to adjust subplot titles and spacing
fig.update_layout(title_text='Boxplots for BMI, Age, and Avg Glucose Level',
                  showlegend=False,
                  height=500, width=1000)

# Show plot
fig.show()

This is the result of the outlier removal process. We have removed the most extreme values from the dataset, ensuring that the data is more representative and suitable for machine learning analysis.

The biggest change in the data set was the removal of the extreme BMI values. The glucose levels were not as affected by the outlier removal process.

Variances of the features¶

Let's have a look at the variances of the features, age, avg_glucose_level and bmi

In [29]:
print(dataset[['age', 'avg_glucose_level', 'bmi']].var())
age                   510.969734
avg_glucose_level    1969.372147
bmi                    58.833108
dtype: float64

Unfortunately we see a big difference in variances, but lets confirm this first with a hypothesis test. Here we should choose a Levene test since we are comparing the variances of multiple groups.

We are conducting a hypothesis test to determine whether the variances of the three features (age, avg_glucose_level, and bmi) are significantly different.

Null Hypothesis (H₀):¶

The variances of the features are equal. $H_0: \sigma_{\text{age}}^2 = \sigma_{\text{avg\_glucose\_level}}^2 = \sigma_{\text{bmi}}^2$

Alternative Hypothesis (H₁):¶

At least one feature has a variance that is different from the others. $H_1: \text{At least one variance is different}$

Test:¶

We will use Levene's Test to test for equality of variances.

  • If the p-value < 0.05, we will reject the null hypothesis, indicating that the variances are significantly different.
  • If the p-value ≥ 0.05, we fail to reject the null hypothesis, suggesting there is no significant difference in variances.
In [30]:
stat, p_value = levene(dataset['age'], dataset['avg_glucose_level'], dataset['bmi'], center='mean')
print(f"Levene's test statistic: {stat}")
print(f"p-value: {p_value}")
Levene's test statistic: 2441.9532774953777
p-value: 0.0

We see that the p-value is less than 0.05, so we reject the null hypothesis. This means that the variances are significantly different. We will have to standardize the data before we can build our models.

Let's try and visualize the data to see how they appear in different ranges.

In [31]:
# Select specific numerical columns
selected_cols = ['age', 'avg_glucose_level', 'bmi']

# Melt the dataset for easier plotting 
df_melted = dataset.reset_index()[['index'] + selected_cols].melt(id_vars=['index'], 
                                                                  value_vars=selected_cols, 
                                                                  var_name='Feature', 
                                                                  value_name='Value')

# Plot 
fig = px.line(df_melted, x='index', y='Value', color='Feature',
              title='Line Plot of Selected Numerical Features',
              labels={'index': 'Observations', 'Value': 'Values'})

# Show the plot
fig.show()

This is how the data looks without having standardized it. We see that the features are on different scales.

Now we can try and standardize the data. We will use the StandardScaler from sklearn to do this. This will make the features have a mean of 0 and a standard deviation of 1.

Recall that the formula that the StandardScaler uses is:

$$z = \frac{x - \mu}{\sigma}$$

Where $z$ is the standardized value, $x$ is the original value, $\mu$ is the mean of the feature and $\sigma$ is the standard deviation of the feature.

In [32]:
# Split the dataset into features and target
selected_cols = ['age', 'avg_glucose_level', 'bmi']

# Standardize the selected columns
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(dataset[selected_cols])

# Replace the original columns with the scaled data
dataset[selected_cols] = scaled_data

# Melt the dataset for easier plotting
df_melted = dataset.reset_index()[['index'] + selected_cols].melt(id_vars=['index'], 
                                                                  value_vars=selected_cols, 
                                                                  var_name='Feature', 
                                                                  value_name='Scaled Value')

# Plot using Plotly Express
fig = px.line(df_melted, x='index', y='Scaled Value', color='Feature',
              title='Line Plot of Scaled Selected Numerical Features',
              labels={'index': 'Observations', 'Scaled Value': 'Scaled Values'})

# Show the plot
fig.show()

The result of the standardization looks like this and we see that the features are now on the same scale.

'Unkown' values in the 'smoking_status' column¶

A problem we encountered in the data set was that the 'smoking_status' column had a value called 'Unknown'. We will have to deal with this issue, since we believe that the smoking status of a patient is a very informative feature and needs to be included in the model.

Let's first take a look at the distribution of the smoking status.

In [33]:
# Mapping dictionary
smoking_status_map = {'never smoked': 0, 'formerly smoked': 1, 'smokes': 2, 'Unknown': 3}

# Reverse the mapping dictionary
reverse_smoking_status_map = {v: k for k, v in smoking_status_map.items()}

# Print most frequent smoking status with original names and their percentages
value_counts = dataset['smoking_status'].value_counts()
total_count = len(dataset['smoking_status'])

# Calculate the longest string lengths for alignment
status_length = max(len(reverse_smoking_status_map[status]) for status in value_counts.index)
count_length = max(len(str(count)) for count in value_counts.values)
percentage_length = 6  # For percentages with 2 decimal points and '%' symbol

# Now we can print the numbers in an organized way
print(f"{'Status':<{status_length}} | {'Count':>{count_length}} | {'Percentage':>{percentage_length}}")
print('-' * (status_length + count_length + percentage_length + 6))

for status, count in value_counts.items():
    percentage = (count / total_count) * 100
    print(f"{reverse_smoking_status_map[status]:<{status_length}} | {count:>{count_length}} | {percentage:>{percentage_length}.2f}%")
Status          | Count | Percentage
-------------------------------
never smoked    | 1885 |  37.08%
Unknown         | 1538 |  30.26%
formerly smoked |  877 |  17.25%
smokes          |  783 |  15.40%

We see that there are a lot of unknown values. 30.26 % !!

The presence of "Unknown" in the smoking status variable poses a challenge for accurate data analysis. These "Unknown" values act similarly to missing data (like NaN) and can distort the relationships between variables. In the original dataset, the "Unknown" category creates uncertainty, as it doesn't provide any useful information about the smoking habits of these individuals. This makes it difficult to assess the true correlations involving smoking status.

Imputation offers one solution by predicting these missing values based on patterns in the data. However, this can introduce bias if the predicted values aren't fully accurate. Another approach is to remove the "Unknown" values entirely, but this reduces the dataset's size and may impact the overall quality of analysis.

Addressing this issue is crucial for building reliable models and ensuring the correlations between variables remain meaningful.

We are gonna try every approach and see which one gives us the best results.

Below we will try to predict the missing values using a Random Forest Classifier. We will use the other features to predict the missing values in the smoking_status column. Although we are going to drop the 'id' and 'smoking_status' columns since we are going to predict the missing values in the 'smoking_status' column, and 'id' is not a feature that can help us predict the missing values.

In [34]:
# Here we are training a Random Forest Classifier to predict the 'Unknown' smoking statuses

# Filter out the 'Unknown' smoking statuses (where smoking_status == 3)
filtered_data = dataset[dataset['smoking_status'] != 3]

# Drop the 'id' column and the 'smoking_status' column (target variable)
X = filtered_data.drop(['id', 'smoking_status'], axis=1)
y = filtered_data['smoking_status']

# Initialize KFold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Initialize lists to store evaluation metrics for each fold
accuracy_scores = []
classification_reports = []

# Perform KFold cross-validation
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Training fold {fold + 1}...")

    # Split the data into training and testing sets for this fold
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Train the RandomForestClassifier
    rf_classifier = RandomForestClassifier(n_estimators=1000, random_state=42, class_weight='balanced')
    rf_classifier.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = rf_classifier.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)
    print(f"Accuracy for fold {fold + 1}: {accuracy:.2f}")

    # Store the classification report for this fold
    report = classification_report(y_test, y_pred, target_names=['never smoked', 'formerly smoked', 'smokes'], output_dict=True)
    classification_reports.append(report)

# Calculate and display the average accuracy across all folds
avg_accuracy = np.mean(accuracy_scores)
print(f"\nAverage accuracy across all folds: {avg_accuracy:.2f}")


# Now that the model is validated, we will retrain on the entire dataset to predict the 'Unknown' smoking statuses

# Train on the entire filtered dataset
rf_classifier.fit(X, y)

# Predict smoking status for the 'Unknown' cases
unknown_data = dataset[dataset['smoking_status'] == 3]
X_unknown = unknown_data.drop(['id', 'smoking_status'], axis=1)
predicted_smoking_status = rf_classifier.predict(X_unknown)

# Create a new dataset with these predictions
dataset_rf_imputed = dataset.copy()
dataset_rf_imputed.loc[dataset_rf_imputed['smoking_status'] == 3, 'smoking_status'] = predicted_smoking_status

print("Smoking statuses imputed for unknown cases.")
Training fold 1...
Accuracy for fold 1: 0.50
Training fold 2...
Accuracy for fold 2: 0.48
Training fold 3...
Accuracy for fold 3: 0.47
Training fold 4...
Accuracy for fold 4: 0.47
Training fold 5...
Accuracy for fold 5: 0.53
Training fold 6...
Accuracy for fold 6: 0.51
Training fold 7...
Accuracy for fold 7: 0.46
Training fold 8...
Accuracy for fold 8: 0.51
Training fold 9...
Accuracy for fold 9: 0.55
Training fold 10...
Accuracy for fold 10: 0.49

Average accuracy across all folds: 0.50
Smoking statuses imputed for unknown cases.

After training a Random Forest Classifier to predict smoking statuses ("never smoked," "formerly smoked," "smokes"), the model achieved an average cross-validation accuracy of 50%. This indicates moderate predictive performance in distinguishing between these categories. Based on this model, we then proceeded to impute the "Unknown" smoking statuses by predicting their most likely categories using the trained classifier.

To evaluate the quality of this imputation, we aim to compare the distribution of imputed smoking statuses against a baseline assumption. Specifically, we assume that all "Unknown" values should belong to the most frequent class, which in this case is "never smoked" (37.08%).

We propose using a chi-square goodness-of-fit test to assess whether the distribution of imputed smoking statuses differs significantly from this baseline distribution. A significant result from this test would suggest that the imputation model has captured meaningful patterns, rather than simply assigning the most frequent class to all "Unknown" values.

Hypotheses for the Chi-Square Goodness-of-Fit Test

Null Hypothesis (H₀):¶

The observed distribution of imputed smoking statuses does not differ significantly from the baseline assumption (i.e., all "Unknown" values are assigned to "never smoked").

Alternative Hypothesis (H₁):¶

The observed distribution of imputed smoking statuses differs significantly from the baseline assumption.

In [35]:
# Total number of "Unknown" cases
total_unknown = len(predicted_smoking_status)

# Baseline distribution (assume all "Unknown" are 'never smoked')
baseline_distribution = [37.08, 17.25, 15.40, 30.26]  # percentages
baseline_counts = [total_unknown * (percentage / 100) for percentage in baseline_distribution]  # Convert to absolute counts

# Normalize the baseline counts to ensure the sum matches the total_unknown
baseline_counts_normalized = [count * (total_unknown / sum(baseline_counts)) for count in baseline_counts]

# Distribution of predicted smoking statuses (absolute counts)
imputed_counts = pd.Series(predicted_smoking_status).value_counts()
imputed_counts_full = [imputed_counts.get(i, 0) for i in range(4)]  # Ensure all categories are represented

# Perform chi-square goodness-of-fit test
chi2_stat, p_value = chisquare(f_obs=imputed_counts_full, f_exp=baseline_counts_normalized)

# Print results
print("Chi-Square Test Results:")
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

if p_value < 0.05:
    print("The distribution of imputed values significantly differs from the baseline (p < 0.05).")
else:
    print("The distribution of imputed values does not significantly differ from the baseline (p >= 0.05).")
Chi-Square Test Results:
Chi-Square Statistic: 1529.2069
P-Value: 0.0000
The distribution of imputed values significantly differs from the baseline (p < 0.05).

The results of the chi-square test confirm that the model used for imputing "Unknown" smoking statuses has successfully identified meaningful patterns in the data. With a highly significant p-value (p < 0.05) and a chi-square statistic of 1529.21, the distribution of the imputed values significantly differs from the baseline assumption, which assigns all cases to the most frequent class ("never smoked"). This demonstrates that the model is not merely defaulting to the majority class but is instead leveraging patterns in the other features to make more informed predictions.

Now we can create the dataset entirely without the 'Unknown' values in the 'smoking_status' column.

In [36]:
dataset_removed = dataset[dataset['smoking_status'] != 3]

We compare three versions of our dataset to determine which is best for analysis. The first is the original dataset, which includes all the raw data but has some smoking statuses labeled as "Unknown." While these aren’t technically missing values, "Unknown" acts similarly to missing data (like NaN) in this case. The second is the imputed dataset, where "Unknown" smoking statuses are predicted using a Random Forest model, allowing us to retain more data while preserving relationships between variables. The third version removes all rows with "Unknown" smoking status, reducing the data size but eliminating uncertainty.

By comparing the correlation matrices, we assess how each approach affects the relationships between variables and choose the most reliable dataset for further analysis.

In [37]:
# Drop the 'id' column from each dataset
dataset_no_id = dataset.drop(columns=['id'])
dataset_rf_imputed_no_id = dataset_rf_imputed.drop(columns=['id'])
dataset_removed_no_id = dataset_removed.drop(columns=['id'])

# Compute the correlation matrices
corr_dataset = dataset_no_id.corr()
corr_rf_imputed = dataset_rf_imputed_no_id.corr()
corr_removed = dataset_removed_no_id.corr()

# Plot the correlation matrices side by side
fig, axes = plt.subplots(1, 3, figsize=(24, 8))

sns.heatmap(corr_dataset, annot=True, fmt='.2f', ax=axes[0], cmap='coolwarm')
axes[0].set_title('Correlation Matrix for original dataset')

sns.heatmap(corr_rf_imputed, annot=True, fmt='.2f', ax=axes[1], cmap='coolwarm')
axes[1].set_title('Correlation Matrix for dataset RF imputed')

sns.heatmap(corr_removed, annot=True, fmt='.2f', ax=axes[2], cmap='coolwarm')
axes[2].set_title('Correlation Matrix for dataset unknown removed')

plt.tight_layout()
plt.show()
No description has been provided for this image

We decide to move forward with the random forest imputed dataset because it provides a balanced and comprehensive approach to handling missing data while preserving the integrity of the dataset.

First, by using random forest imputation, we address the missing or "unknown" smoking status values in a way that relies on the relationships between other variables in the data. This is much more effective than simply categorizing these values as "Unknown" or removing them entirely. It allows us to retain more data, which increases the sample size and, in turn, the statistical power of our analysis.

Another important reason for choosing this approach is that it preserves the natural structure and relationships within the dataset. Imputation through random forests uses patterns from existing data to predict missing values, which means the dataset remains internally consistent and closer to reality. This increases the accuracy of any insights or models that rely on this data.

Additionally, the imputed dataset retains around 1500 more observations compared to the dataset with outliers removed, ensuring we don’t lose valuable information. The more data we can use, the better our chances of building robust models and drawing meaningful conclusions.

When we compared the correlation matrices, we noticed that the imputed dataset maintained similar relationships to those found in the original dataset, showing that the imputation process didn’t disrupt key correlations. In contrast, removing outliers led to shifts in some relationships, likely due to the reduction in sample size.

In short, choosing the random forest imputed dataset means we keep a more complete dataset, preserve essential correlations, and ensure that missing values are handled thoughtfully. This approach provides a solid foundation for our analysis, leading to more reliable and accurate results.

In [38]:
dataset_rf_imputed.head(20)
Out[38]:
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 1.0 1.054748 0 1 1 0 1 2.775212 1.012276 1 1
1 51676 0.0 0.789290 0 0 1 1 0 2.178456 0.516275 0 1
2 31112 1.0 1.629908 0 1 1 0 0 0.008454 0.477693 0 1
3 60182 0.0 0.258373 0 0 1 0 1 1.480287 0.725426 2 1
4 1665 0.0 1.585665 1 0 1 1 0 1.545417 -0.630591 0 1
5 56669 1.0 1.674151 0 0 1 0 1 1.817878 0.021341 1 1
6 53882 1.0 1.364450 1 1 1 0 0 -0.799014 -0.187277 0 1
7 10434 0.0 1.143234 0 0 0 0 1 -0.251387 -0.787054 0 1
8 27419 0.0 0.700804 0 0 1 0 0 -0.662445 -0.134003 1 1
9 60491 0.0 1.541422 0 0 1 0 1 -1.058630 -0.604513 0 1
10 12109 0.0 1.674151 1 0 1 0 0 -0.565991 0.112611 0 1
11 12095 0.0 0.789290 0 1 1 2 0 0.336129 1.038353 2 1
12 12175 0.0 0.479588 0 0 1 0 1 -0.023321 -0.200316 2 1
13 8213 1.0 1.541422 0 1 1 0 1 2.575767 0.670888 1 1
14 5317 0.0 1.585665 0 1 1 0 1 2.446185 -0.082968 0 1
15 58202 0.0 0.302616 1 0 1 1 0 1.394199 0.269075 0 1
16 56112 1.0 0.922019 0 1 1 0 1 1.939573 1.129624 2 1
17 34120 1.0 1.408693 1 0 1 0 1 2.608445 -0.395895 2 1
18 27458 0.0 0.745047 0 0 0 0 1 -0.367898 1.168740 0 1
19 25226 1.0 0.612318 0 1 0 2 1 2.513568 1.567321 0 1

This is what the data looks like after we have imputed the missing values in the 'smoking_status' column.

SMOTE for Class Imbalance¶

From experience we know that having a class imbalance can lead to poor performance of the model. So before we start developing our models we will have to look at the target variable distribution.

Let's try and look at the distribution of the target variable.

In [39]:
# Prepare the data for the histogram
stroke_counts = dataset_rf_imputed['stroke'].value_counts()

# Create the Plotly bar plot
fig = px.bar(
    x=stroke_counts.index,
    y=stroke_counts.values,
    text=stroke_counts.values,
    color=stroke_counts.index,
    title="Distribution of Stroke Classes",
    labels={'x': 'Class (0 = No Stroke, 1 = Stroke)', 'y': 'Count'},
    color_continuous_scale='Viridis'
)

# Customize the layout
fig.update_traces(
    textposition='outside',
    marker=dict(line=dict(width=1, color='Black'))  # Add border to bars
)
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=[0, 1],
        ticktext=["No Stroke", "Stroke"]
    ),
    yaxis_title="Count",
    xaxis_title="Class (0 = No Stroke, 1 = Stroke)",
    title_font_size=15
)

# Show the figure
fig.show()
In [40]:
# Non-altered dataset
X_original = dataset_rf_imputed.drop(['id', 'stroke'], axis=1)
y_original = dataset_rf_imputed['stroke']

# Perform train-test split
X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(
    X_original, y_original, test_size=0.2, random_state=42, stratify=y_original
)

# Print the split sizes for verification
print("Original Dataset Split:")
print(f"X_train length: {len(X_train_original)}")
print(f"X_test length: {len(X_test_original)}")

# Predict the number of stroke (0 and 1) in training and testing sets
train_counts = y_train_original.value_counts()
test_counts = y_test_original.value_counts()

print("\nClass distribution in the training set:")
print(f"No Stroke (0): {train_counts[0]}")
print(f"Stroke (1): {train_counts[1]}")

print("\nClass distribution in the testing set:")
print(f"No Stroke (0): {test_counts[0]}")
print(f"Stroke (1): {test_counts[1]}")
Original Dataset Split:
X_train length: 4066
X_test length: 1017

Class distribution in the training set:
No Stroke (0): 3870
Stroke (1): 196

Class distribution in the testing set:
No Stroke (0): 968
Stroke (1): 49

We see that the target variable is highly imbalanced. We will have to deal with this before we can build our models. Imbalanced data can lead to poor model performance, especially for classification tasks. To address this issue, we will have to use some kind of oversampling technique. We will use the Synthetic Minority Over-sampling Technique (SMOTE) to create synthetic samples of the minority class (stroke = 1) and balance the dataset.

What is SMOTE?¶

SMOTE works by generating synthetic samples for the minority class rather than duplicating existing samples. It does this by identifying the nearest neighbors of a minority class sample and creating new samples along the line segments connecting the sample to its neighbors. This technique ensures that the dataset becomes balanced while maintaining diversity in the synthetic samples. By doing so, SMOTE improves the classifier's ability to learn patterns from the minority class without overfitting.

Data leakage avoidage¶

Data leakage occurs when information from the test set influences the training process, leading to overly optimistic model performance. When using SMOTE, including the test set during oversampling can result in synthetic samples that inadvertently contain information about the test set. This violates the assumption that the test set is unseen and independent.

To prevent this, SMOTE is applied only to the training set after the train-test split from before. The test set remains untouched, preserving its original class distribution. This ensures that the model is evaluated fairly on a realistic test set, maintaining the integrity of the results.

Read more about SMOTE here: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

In [41]:
# Applying SMOTE to balance the training dataset

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE only to the training set
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_original, y_train_original)

# Print the sizes of the resampled training set
print("Training Set After SMOTE:")
print(f"X_train length: {len(X_train_resampled)}")
print(f"y_train length: {len(y_train_resampled)}")

# Count the number of stroke (0 and 1) in the resampled training set
resampled_counts = y_train_resampled.value_counts()
print("\nClass distribution in the resampled training set:")
print(f"No Stroke (0): {resampled_counts[0]}")
print(f"Stroke (1): {resampled_counts[1]}")

# Count the number of stroke (0 and 1) in the untouched test set
test_counts = y_test_original.value_counts()
print("\nClass distribution in the testing set (untouched):")
print(f"No Stroke (0): {test_counts[0]}")
print(f"Stroke (1): {test_counts[1]}")

# Visualize the resampled training set distribution using Plotly
import plotly.express as px

fig = px.bar(
    x=resampled_counts.index,
    y=resampled_counts.values,
    text=resampled_counts.values,
    color=resampled_counts.index,
    title="Distribution of Stroke Classes in Training Set (after SMOTE)",
    labels={'x': 'Class (0 = No Stroke, 1 = Stroke)', 'y': 'Count'},
    color_continuous_scale='Viridis'
)

# Customize the layout
fig.update_traces(
    textposition='outside',
    marker=dict(line=dict(width=1, color='Black'))  # Add border to bars
)
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=[0, 1],
        ticktext=["No Stroke", "Stroke"]
    ),
    yaxis_title="Count",
    xaxis_title="Class (0 = No Stroke, 1 = Stroke)",
    title_font_size=15
)

# Show the figure
fig.show()
Training Set After SMOTE:
X_train length: 7740
y_train length: 7740

Class distribution in the resampled training set:
No Stroke (0): 3870
Stroke (1): 3870

Class distribution in the testing set (untouched):
No Stroke (0): 968
Stroke (1): 49

The plot shows that after applying SMOTE, the training set is now perfectly balanced. Both classes (No Stroke and Stroke) have an equal number of samples (3870 each). This ensures that the model will have equal representation of both classes during training, addressing the imbalance issue present in the original dataset.

Since we have now balanced the dataset we can move on to building our models.

Primary analysis¶

Based on the results of the exploration, predicting whether a person is at risk of having a stroke is naturally framed as a binary classification problem. Logistic regression is a strong candidate for this task due to its simplicity and interpretability, particularly for binary outcomes. However, we chose to use a random forest classifier instead. This decision was driven by its robustness and ability to handle mixed data types effectively, such as the combination of categorical and continuous variables present in the dataset. Random forests also excel at capturing complex, non-linear relationships between features, which may be critical for accurately predicting stroke risk. Additionally, they provide built-in mechanisms for assessing feature importance, offering valuable insights into the factors most strongly associated with stroke risk.

In health and medical applications, the primary objective when training machine learning models is often to maximize recall. Recall is critical in these settings because failing to identify true positive cases, or false negatives, can have severe consequences. However, while prioritizing recall, it is essential to ensure the model generalizes well to unseen data and avoids both overfitting and underfitting.

To achieve this balance, a nested cross-validation strategy is used. This approach involves an inner loop and an outer loop, each with distinct roles. The inner loop focuses on hyperparameter tuning, where grid search is employed to systematically explore combinations of hyperparameters for the Random Forest model, such as the number of estimators, maximum depth, and minimum samples required for splits or leaves. During this process, the model is evaluated using a recall-focused scoring metric, ensuring it is optimized for identifying true positive cases. By performing this hyperparameter tuning independently within each training fold of the outer loop, the risk of data leakage and overfitting to the entire dataset is minimized.

The outer loop serves as the model's final evaluation. For each fold, the data is split into a training set, used for grid search and hyperparameter optimization, and a test set, which remains unseen until the evaluation stage. After the inner loop identifies the best hyperparameters for a particular fold, the corresponding model is evaluated on the outer test set. This ensures that the recall score reflects the model’s performance on completely unseen data, providing a robust estimate of its generalizability. By separating hyperparameter optimization from final evaluation, nested cross-validation avoids overly optimistic recall estimates and ensures a fair assessment of the model.

After all folds in the outer loop are completed, the overall performance is summarized by calculating the mean recall across the folds. Additionally, the best hyperparameter configurations from each fold are analyzed to determine the most frequently selected parameters, offering insights into the model's optimal setup. This process ensures that the chosen model configuration not only achieves high recall but also demonstrates consistent performance across multiple data splits.

By using this methodology, the model is designed to prioritize recall while maintaining generalizability, making it particularly suitable for health and medical applications where minimizing false negatives is crucial. This robust evaluation framework ensures that the final model is both effective and reliable for deployment in real-world scenarios.

Below, we perform a 5-fold outer cross-validation to evaluate the generalization performance of the model. For each fold in the outer loop, we use a 3-fold inner cross-validation to perform a grid search for hyperparameter tuning of the Random Forest Classifier. The inner loop identifies the optimal hyperparameters for the model (e.g., number of estimators, maximum depth, etc.) by testing multiple configurations, while the outer loop ensures that the evaluation is done on unseen data, providing an unbiased estimate of the model's performance with the selected hyperparameters.

In [42]:
# Define the hyperparameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create the outer cross-validation strategy
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Create the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42, class_weight="balanced")

# Create the GridSearchCV object for inner cross-validation
grid_search = GridSearchCV(
    estimator=rf_classifier,
    param_grid=param_grid,
    scoring='recall',
    cv=3,  # Inner cross-validation folds
    n_jobs=-1,
    verbose=1
)

# Track the best parameters for each fold
best_params_per_fold = []

# Perform the outer cross-validation
outer_scores = []
for train_idx, test_idx in outer_cv.split(X_train_resampled, y_train_resampled):
    # Split the resampled data into training and validation sets
    X_train_outer, X_test_outer = X_train_resampled.iloc[train_idx], X_train_resampled.iloc[test_idx]
    y_train_outer, y_test_outer = y_train_resampled.iloc[train_idx], y_train_resampled.iloc[test_idx]
    
    # Fit GridSearchCV on the training fold
    grid_search.fit(X_train_outer, y_train_outer)
    
    # Evaluate the best model on the outer test set
    best_model = grid_search.best_estimator_
    y_pred_outer = best_model.predict(X_test_outer)
    recall_score_outer = recall_score(y_test_outer, y_pred_outer)
    
    # Append the recall score
    outer_scores.append(recall_score_outer)
    
    # Append the best parameters for this fold
    best_params_per_fold.append(grid_search.best_params_)
    
    # Print results for this fold
    print(f"Outer CV Fold Results:")
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Recall on Outer Test Fold: {recall_score_outer:.4f}")
    print("-" * 50)

# Print the overall results
print("Final Cross-Validation Results:")
print(f"Mean Recall Across Outer Folds: {sum(outer_scores) / len(outer_scores):.4f}")
print(f"Recall Scores for Each Fold: {outer_scores}")

# Print the most frequent best parameter combination
most_frequent_params = Counter([str(params) for params in best_params_per_fold]).most_common(1)
print("\nMost Frequent Best Parameters Across Folds:")
print(most_frequent_params[0][0])  # Print the most frequent parameter set
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Outer CV Fold Results:
Best Parameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Recall on Outer Test Fold: 0.9806
--------------------------------------------------
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Outer CV Fold Results:
Best Parameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Recall on Outer Test Fold: 0.9858
--------------------------------------------------
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Outer CV Fold Results:
Best Parameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}
Recall on Outer Test Fold: 0.9690
--------------------------------------------------
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Outer CV Fold Results:
Best Parameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Recall on Outer Test Fold: 0.9638
--------------------------------------------------
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Outer CV Fold Results:
Best Parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}
Recall on Outer Test Fold: 0.9483
--------------------------------------------------
Final Cross-Validation Results:
Mean Recall Across Outer Folds: 0.9695
Recall Scores for Each Fold: [np.float64(0.9806201550387597), np.float64(0.9857881136950905), np.float64(0.9689922480620154), np.float64(0.9638242894056848), np.float64(0.9483204134366925)]

Most Frequent Best Parameters Across Folds:
{'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}

The results of the outer cross-validation indicate that the Random Forest model achieved consistently high recall scores across all folds, with a mean recall of 0.9695. Each fold's best hyperparameters are shown, and the most frequently selected configuration includes:

  • max_depth: 15
  • min_samples_leaf: 1
  • min_samples_split: 2
  • n_estimators: 100

An interesting observation here is that n_estimators = 100 and n_estimators = 150 are equally present. Let's recall Occam's Razor, it states that we should choose n_estimators = 100, since it's the least complex model, but still performs equally well as the more complex model.

The individual recall scores for the outer test folds vary slightly, ranging from 0.9483 to 0.9858, but all scores remain well above 0.96, demonstrating the model's robustness and ability to generalize effectively.

Now that we have a well-tuned model, we can train it on the entire dataset to maximize its predictive power. By using the best hyperparameters identified during cross-validation, we ensure that the model is optimized for performance on the full dataset. This final training step prepares the model for deployment and use in real-world scenarios, where it can predict stroke risk with high accuracy and reliability.

In [43]:
# Fit the final model using the best parameters from GridSearchCV
rf_classifier = grid_search.best_estimator_

# Evaluate the model on the original test set
y_pred_original = rf_classifier.predict(X_test_original)

# Print classification report
print("Classification Report on Original Test Set:")
print(classification_report(y_test_original, y_pred_original))

# Print confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test_original, y_pred_original))

# Calculate and print ROC-AUC score
roc_auc = roc_auc_score(y_test_original, rf_classifier.predict_proba(X_test_original)[:, 1])
print(f"\nROC-AUC Score: {roc_auc:.4f}")
Classification Report on Original Test Set:
              precision    recall  f1-score   support

           0       0.96      0.91      0.93       968
           1       0.14      0.29      0.19        49

    accuracy                           0.88      1017
   macro avg       0.55      0.60      0.56      1017
weighted avg       0.92      0.88      0.90      1017


Confusion Matrix:
[[880  88]
 [ 35  14]]

ROC-AUC Score: 0.7918

Let's compare the results of the SMOTE model with a model that is trained with the original non-SMOTE data

In [44]:
# Train a Random Forest Classifier on the original unsmoted training set
rf_classifier_no_smote = RandomForestClassifier(
    random_state=42,
    max_depth=15,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=100,
    class_weight="balanced"
)

# Fit the model on the original training set
rf_classifier_no_smote.fit(X_train_original, y_train_original)

# Evaluate the model on the original test set
y_pred_no_smote = rf_classifier_no_smote.predict(X_test_original)

# Print classification report
print("Classification Report on Original Test Set (No SMOTE):")
print(classification_report(y_test_original, y_pred_no_smote))

# Print confusion matrix
print("\nConfusion Matrix (No SMOTE):")
print(confusion_matrix(y_test_original, y_pred_no_smote))

# Calculate and print ROC-AUC score
roc_auc_no_smote = roc_auc_score(y_test_original, rf_classifier_no_smote.predict_proba(X_test_original)[:, 1])
print(f"\nROC-AUC Score (No SMOTE): {roc_auc_no_smote:.4f}")
Classification Report on Original Test Set (No SMOTE):
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       968
           1       0.00      0.00      0.00        49

    accuracy                           0.95      1017
   macro avg       0.48      0.50      0.49      1017
weighted avg       0.91      0.95      0.93      1017


Confusion Matrix (No SMOTE):
[[964   4]
 [ 49   0]]

ROC-AUC Score (No SMOTE): 0.8007

Despite achieving high accuracy (95%) and excellent recall for the majority class (non-stroke), the non-SMOTE model failed entirely for the minority class (stroke), with a recall of 0.00 and zero correct stroke predictions. This outcome highlights the severe impact of class imbalance, where the model becomes overly biased toward the majority class and effectively ignores the minority class.

In contrast, the SMOTE-enhanced model, while achieving a lower overall accuracy of 88%, successfully predicted 14 stroke cases correctly. Although its precision for stroke cases was lower at 0.14, and the recall was modest at 0.29, this represents a significant improvement in the model's ability to detect stroke cases. The ROC-AUC score of 0.7918 for the SMOTE model also indicates a more balanced performance in distinguishing between stroke and non-stroke cases compared to the non-SMOTE model's 0.8007, which primarily reflects its strong performance on the majority class.

By incorporating SMOTE, the model shifts its focus to better identify the minority class, trading some precision for improved recall. In a health setting, where identifying stroke cases is critical, this trade-off is worthwhile, as it ensures more stroke cases are detected, reducing the likelihood of false negatives and improving the model's clinical utility.

Statistical Difference between the Models¶

To statistically compare the performance of the SMOTE-enhanced model and the non-SMOTE model, McNemar's test was performed. This test evaluates whether the differences in incorrect predictions between the two models are statistically significant. McNemar's test is particularly suited for this scenario because it is designed for paired data, where the same test set is used for both models. It focuses on the instances where the models disagree in their predictions—cases where one model is correct, and the other is incorrect. By analyzing these disagreements, the test determines whether the performance difference is due to random chance or a meaningful improvement.

Again, let's formulate the hypotheses for the McNemar's test:

Null Hypothesis (H₀):¶

The difference in incorrect predictions between the SMOTE-enhanced model and the non-SMOTE model is due to random chance. In other words, there is no significant difference in the models' performance.

Alternative Hypothesis (H₁):¶

The difference in incorrect predictions between the SMOTE-enhanced model and the non-SMOTE model is statistically significant, indicating that the performance difference is meaningful.

In [45]:
# Confusion matrix for SMOTE model
conf_matrix_smote = confusion_matrix(y_test_original, y_pred_original)

# Confusion matrix for non-SMOTE model
conf_matrix_no_smote = confusion_matrix(y_test_original, y_pred_no_smote)

# Create contingency table for McNemar's test
# Focus on disagreements between the two models
disagreements = np.zeros((2, 2), dtype=int)

# Model 1 (SMOTE) correct, Model 2 (No-SMOTE) incorrect
disagreements[0, 1] = np.sum((y_pred_original == y_test_original) & (y_pred_no_smote != y_test_original))

# Model 1 (SMOTE) incorrect, Model 2 (No-SMOTE) correct
disagreements[1, 0] = np.sum((y_pred_original != y_test_original) & (y_pred_no_smote == y_test_original))

# Perform McNemar's test
result = mcnemar(disagreements, exact=True)

# Print results
print("McNemar's Test Results:")
print(f"Statistic: {result.statistic}")
print(f"P-Value: {result.pvalue}")

if result.pvalue < 0.05:
    print("There is a significant difference between the SMOTE model and the non-SMOTE model (p < 0.05).")
else:
    print("There is no significant difference between the SMOTE model and the non-SMOTE model (p >= 0.05).")
McNemar's Test Results:
Statistic: 14.0
P-Value: 2.4575158375112673e-13
There is a significant difference between the SMOTE model and the non-SMOTE model (p < 0.05).

The results of McNemar's test revealed a statistically significant difference in performance between the SMOTE-enhanced model and the non-SMOTE model. The test yielded a test statistic of $14.0$ and a p-value of $2.46 \times 10^{-13}$, which is far below the commonly used significance threshold of $0.05$.

This indicates strong evidence to reject the null hypothesis that the difference in incorrect predictions is due to random chance. Consequently, we conclude that the SMOTE-enhanced model's performance is significantly better than that of the non-SMOTE model. These findings highlight that applying SMOTE has resulted in a meaningful improvement in the model's ability to correctly classify instances.

Visualization¶

In the following we will try to convey the results of the analysis in a more visual way. We will start by looking at the confusion matrix for the model made with and without SMOTE.

In [46]:
# Confusion Matrix for SMOTE Model
conf_matrix_smote = confusion_matrix(y_test_original, y_pred_original)

conf_matrix_smote_fig = ff.create_annotated_heatmap(
    z=conf_matrix_smote,  # Use the original confusion matrix
    x=['Predicted No Stroke', 'Predicted Stroke'],  # Predicted labels
    y=['Actual No Stroke', 'Actual Stroke'],        # Actual labels
    colorscale='Viridis',  # Use consistent coloring
    showscale=True
)
conf_matrix_smote_fig.update_layout(
    title="Confusion Matrix for SMOTE Model",
    xaxis_title="Predicted Label",
    yaxis_title="Actual Label",
    autosize=True
)
# Reverse the y-axis to align the actual labels correctly
conf_matrix_smote_fig.update_yaxes(autorange='reversed')

# Confusion Matrix for Non-SMOTE Model
conf_matrix_no_smote = confusion_matrix(y_test_original, y_pred_no_smote)

conf_matrix_no_smote_fig = ff.create_annotated_heatmap(
    z=conf_matrix_no_smote,  # Use the original confusion matrix
    x=['Predicted No Stroke', 'Predicted Stroke'],  # Predicted labels
    y=['Actual No Stroke', 'Actual Stroke'],        # Actual labels
    colorscale='Viridis',  # Use consistent coloring
    showscale=True
)
conf_matrix_no_smote_fig.update_layout(
    title="Confusion Matrix for Non-SMOTE Model",
    xaxis_title="Predicted Label",
    yaxis_title="Actual Label",
    autosize=True
)
# Reverse the y-axis for the second confusion matrix as well
conf_matrix_no_smote_fig.update_yaxes(autorange='reversed')

# Show the plots
conf_matrix_smote_fig.show()
conf_matrix_no_smote_fig.show()

The two plots show confusion matrices for two machine learning models predicting strokes. The first plot corresponds to the model trained using SMOTE. According to the confusion matrix, the model correctly identified 880 non-stroke cases and 14 stroke cases. However, it misclassified 88 non-stroke cases as strokes and 35 stroke cases as non-strokes.

The second plot corresponds to the model trained without SMOTE. It correctly identified 964 non-stroke cases, but failed to classify any stroke cases correctly. Additionally, it misclassified 4 non-stroke cases as strokes and 49 stroke cases as non-strokes.

This comparison highlights how SMOTE impacts the model's performance by focusing on the minority class (stroke cases) and attempting to balance the predictions. The model trained with SMOTE performs better in identifying stroke cases, albeit at the cost of misclassifying some non-stroke cases as strokes. In contrast, the model without SMOTE is biased towards the majority class and fails to detect any stroke cases.

Insights¶

Feature Importance Analysis¶

Now after having trained the model we can look at the feature importances. This will tell us which features are most important when predicting whether a person is at risk of having a stroke. This is a very important step since it can give us insights into what causes strokes and what we can do to prevent them. The way this works is by looking at how much each feature reduces uncertainty or "impurity" in the model. The more a feature reduces uncertainty the more important it is.

In the Random Forest Classifier in scikit-learn, for every split in every tree the reduction of impurity is calculated. The feature importance is then calculated by averaging the reduction of impurity over all the trees in the forest.

In [47]:
# Calculate feature importances
feature_importances = pd.DataFrame({
    'Feature': X_train_resampled.columns,
    'Importance': rf_classifier.feature_importances_
}).sort_values(by='Importance', ascending=False)

feature_importance_fig = go.Figure()

feature_importance_fig.add_trace(
    go.Bar(
        x=feature_importances['Importance'],
        y=feature_importances['Feature'],
        orientation='h',
        marker=dict(color='blue')
    )
)

feature_importance_fig.update_layout(
    title='Feature Importances from Random Forest (Maximizing Recall)',
    xaxis_title='Importance',
    yaxis_title='Features',
    yaxis=dict(autorange='reversed'), 
    template='plotly_white'
)

# Show the plot
feature_importance_fig.show()

The feature importances reveal that age is the most significant predictor of stroke risk. This aligns with medical knowledge, as age is a well-established risk factor for strokes. Additionally, avg_glucose_level and BMI are shown to be important features, which is also expected given that high glucose levels and elevated BMI are recognized contributors to stroke risk.

Conversely, features like heart disease, hypertension, and residence type appear to have lower importance. During our earlier exploratory data analysis (EDA), we observed that the smoker status column had many missing values. To address this, we imputed the missing values using a random forest classifier. However, this process may not have yielded the desired results, or it is possible that smoker status is highly correlated with other features, contributing little unique information to the model.

We know that the feature importances are calculated by looking at how much each feature reduces uncertainty in the model. This is done by looking at how much each feature reduces the impurity in the model. Therefore we know that the sum of the feature importances should be equal to 1. Let's check if this is the case.

In [48]:
cumulative_importance = np.cumsum(feature_importances['Importance'])
print(cumulative_importance)
1    0.421845
7    0.598479
8    0.758315
5    0.821085
0    0.877884
9    0.921355
4    0.948084
6    0.973062
2    0.989138
3    1.000000
Name: Importance, dtype: float64

Here we see that the sum of the feature importances is equal to 1. This means that the feature importances are correctly calculated.

Let's see how much the first 3 features contribute to the model. This number will represent the cumulative impurity reduction of the 3 most important features.

In [49]:
top_3_features = feature_importances.head(3)
cumulative_importance_top_3 = top_3_features['Importance'].sum()
print(f"The top 3 features are: {top_3_features['Feature'].values}")
print(f'Cumulative importance of top 3 features: {cumulative_importance_top_3}')
The top 3 features are: ['age' 'avg_glucose_level' 'bmi']
Cumulative importance of top 3 features: 0.7583147487305235

The cumulative importance of the top 3 features is 0.76, indicating that these features account for a significant portion of the model's predictive power. This highlights the importance of age, avg_glucose_level, and BMI in determining stroke risk. By focusing on these key features, healthcare professionals can better assess and manage stroke risk in patients, potentially leading to more effective prevention strategies and interventions.

t-SNE and PCA Visualization¶

To gain further insights into the data and the model's performance, we can visualize the dataset using dimensionality reduction techniques like t-SNE and PCA. These methods help us understand the distribution of data points in a lower-dimensional space, revealing patterns and clusters that may not be apparent in the original feature space.

In [50]:
# Perform t-SNE to reduce to 2 components
tsne = TSNE(n_components=2, random_state=42, perplexity=30, max_iter=1000)
X_tsne_original = tsne.fit_transform(X_original)

# Create a DataFrame for Plotly
df_tsne = pd.DataFrame(X_tsne_original, columns=['t-SNE Component 1', 't-SNE Component 2'])
df_tsne['Label'] = y_original

# Plot with Plotly
fig = px.scatter(
    df_tsne,
    x='t-SNE Component 1',
    y='t-SNE Component 2',
    color='Label',
    title='t-SNE Visualization (Original Dataset)',
    labels={'color': 'Stroke'},
    color_continuous_scale='Viridis',
    opacity=0.7
)

fig.show()
In [51]:
# Perform PCA to reduce to 2 components
pca = PCA(n_components=2, random_state=42)
X_pca_original = pca.fit_transform(X_original)

# Create a DataFrame for Plotly
df_pca = pd.DataFrame(X_pca_original, columns=['Principal Component 1', 'Principal Component 2'])
df_pca['Label'] = y_original

# Plot with Plotly
fig = px.scatter(
    df_pca,
    x='Principal Component 1',
    y='Principal Component 2',
    color='Label',
    title='PCA Visualization (Original Dataset)',
    labels={'color': 'Stroke'},
    color_continuous_scale='Viridis',
    opacity=0.7
)

fig.show()

The visualizations using t-SNE and PCA demonstrate the complexity of the dataset, where distinguishing between stroke and non-stroke cases is not apparent. The overlapping nature of the points across both classes indicates that the patterns and correlations needed to separate these classes are not easily interpretable in a low-dimensional space. This highlights one of the key challenges in analyzing medical data: the subtlety of patterns that may signify critical outcomes, like a stroke, can often elude human perception.

However, this is precisely where machine learning excels. Despite the visual difficulty of separating the classes, our model demonstrated its ability to detect patterns in the data that are not immediately visible to humans. When evaluated on the original test set, the model achieved a recall of 0.29 and a precision of 0.14 for the minority class (stroke), correctly predicting 14 stroke cases. While these numbers may appear modest, they represent a significant step forward in identifying stroke cases.

This result underscores the power of machine learning to find meaningful patterns in complex, high-dimensional data that humans would struggle to identify. By leveraging advanced algorithms, the model was able to achieve a balanced performance, capturing some stroke cases even in a challenging, imbalanced dataset.

Conclusion¶

This project focused on predicting stroke using machine learning, addressing key challenges like data imbalance and complexity. The process began with thorough data preprocessing, including outlier removal to reduce noise, standardization of continuous features to ensure consistency, and Random Forest-based imputation for missing smoking status values. To tackle the imbalance in the dataset, we applied SMOTE to generate synthetic examples of stroke cases, allowing the model to better detect this minority class.

We trained a Random Forest classifier to predict stroke, which achieved an overall accuracy of 88% on the original test set. The model successfully identified some stroke cases, with a recall of 0.29 and precision of 0.14 for the minority class. While these numbers highlight the difficulty of the problem, the model’s ability to identify 14 stroke cases, which were otherwise completely missed without SMOTE, demonstrates its potential. The ROC-AUC score of 0.7918 further indicates the model’s moderate success in distinguishing stroke cases from non-stroke cases, highlighting the strengths of machine learning in detecting subtle patterns where human analysis might struggle.

Through this analysis, we identified age, BMI, and glucose levels as the three most crucial factors influencing stroke prediction.

For future work, it would be beneficial to train the model on a dataset with a more realistic class balance. While SMOTE helped address the imbalance in this study, relying on synthetic data may not fully represent the complexity of real-world stroke cases. A dataset with a naturally balanced distribution could improve the reliability of the results and offer a more accurate assessment of the model’s performance. Further exploration of alternative algorithms, ensemble methods, and advanced feature engineering could also enhance prediction accuracy and recall.

This study demonstrates the potential of machine learning in medical applications, specifically in identifying stroke cases and uncovering critical predictive factors. While challenges remain, this work provides a foundation for developing tools that minimize false negatives and support timely stroke detection.